Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Global use for directory and subdirectory paths | Alternatives

    Hello all:

    Before I pose my query, I will say I have seen several words of caution on not using globals unless absolutely necessary. My objective below is to develop a program that takes in argument for a directory name and sets the working directory to my project at hand with globals pointing to subdirectories or other directories proximal/higher up relative to the root project directory. I want to use -include- do files from different directories a different discontinuous chunks of the project master do file (where I presume locals would not work, other than if I repetitively declare them within the chunk scope).

    Is this structure below really unstable? I am planning to have all future projects use the same subdirectory names otherwise. I was planning to convert the one below into a program that I could readily call for each project I would work on. Do any of you foresee major issues? Again, I have 3 different PCs and my student has a MAC and we work off Dropbox. I am fairly certain that I would not use the global names in any local macro names, but would like to know what could go wrong. If there is an easier stable way without globals, I am happy to take pointers.


    Code:
    *! Program dropbox destinations 1.0
    * Sets Dropbox for all 3 PC and MACs
    * Define working directory for Project folder
    * Argument to set project folder as working dir
    *-----------------------------------
    cap program drop stemmer
    program define stemmer 
    
    *Stata versions
        version 17
        clear all
        set linesize 80
    
        set more off
        frames reset
    cap drop __000*
    cls
    
    args prfolder
    
    // Define all 4 roots directories to dropbox across Emily and GV PCs
    *-------------------------------------------------------------------
    cap cd "/Users/emilys/Dropbox/"
    cap cd "C:/Users/Vijayalakshmi/Dropbox/"
    cap cd "C:/Users/NewPC/Dropbox/"
    cap cd "D:/gvenkataraman/Documents/Dropbox/"
    
    
    *-------------DEFINE STEM DIR---------------------------*
    
    local stem  `c(pwd)'   // Everything leading upto Dropbox dir
    di "`stem'"
    
    
    *---------DEFINE WORKING & SAVING DIR-------------------*
    
    if regexm(c(os), "Mac") == 1 {
        global working     "`stem'//gv_includes/"
        global prfolder "`stem'//`prfolder'/"
        global output     "${prfolder}//outputs"
        global prdofile "${prfolder}//dofiles"
                }
    else if regexm(c(os), "Windows") == 1 {
        global working     "`stem'//Girish Files/gv_includes/"
        global prfolder "`stem'//Girish Files/A_UOC academic/`prfolder'"
        global output     "${prfolder}//outputs"
        global prdofile    "${prfolder}//dofiles"
        }
    *------------CHANGE WORKING DIR to Project ROOT----------*    
    cd "${working}"
    
    di "Working directory (working)        : $working"
    di "Project dir (prfolder)            : $prfolder"
    di "Output dir (outputs)            : $output"
    di "Project do file dir (prdofile)     : $prdofile"    
    cd "$prfolder"
    end
    Sample call in master do.file:
    Code:
    stemmer "Prj_akp53"
    
    use "$prfolder//p53merged-v2.dta"
    qui include "./p53select.doh"
    
    
    ***Several lines down further
    sts graph....
    graph export "${output}//testgraph.png", replace

  • #2
    I'm the person who most frequently warns against using global macros around here.

    Look, you will get away with it nearly all the time. The problem is that in the infrequent situation where it goes wrong, it goes disastrously wrong and finding the source of the problem is extremely difficult and frustrating. The way it can go wrong is if somebody runs another program that uses a global macro with the same name as one of the macros you have defined. That program might misread its global macros as having the value of the one you defined. Or the other way around: itmight change the value of your global macro which you then use. And what makes it so frustrating to deal with is that you might not even know that this other program exists: it could be a program called within a program called within a program and you are only aware of the existence top level program (which might, for example, be a user-written program that you use without knowing the details of its innards). The effects of a global macro whose contents have been switched in part of the code you don't see can be bizarre and protean, and finding the source of the problem can be a Gordian knot. I would be particular concerned about global macros with names like program and output, which might be in common use. prfolder and prdofile as a bit less worrisome.

    If I were in your situation, I would make an include file that defines all of these things as local macros (and, crucially, not inside a program).
    Code:
    // BEGIN DIRECTORY_INITIALIZATION.DOH
    * Sets Dropbox for all 3 PC and MACs
    * Define working directory for Project folder
    * Argument to set project folder as working dir
    *-----------------------------------
    *Stata versions
        version 17
        clear all
        set linesize 80
    
        set more off
        frames reset
    cap drop __000*
    cls
    
    
    // Define all 4 roots directories to dropbox across Emily and GV PCs
    *-------------------------------------------------------------------
    cap cd "/Users/emilys/Dropbox/"
    cap cd "C:/Users/Vijayalakshmi/Dropbox/"
    cap cd "C:/Users/NewPC/Dropbox/"
    cap cd "D:/gvenkataraman/Documents/Dropbox/"
    
    
    *-------------DEFINE STEM DIR---------------------------*
    
    local stem  `c(pwd)'   // Everything leading upto Dropbox dir
    di "`stem'"
    
    
    *---------DEFINE WORKING & SAVING DIR-------------------*
    
    if regexm(c(os), "Mac") == 1 {
        local working     "`stem'//gv_includes/"
        local prfolder "`stem'//`prfolder'/"
        local output     "`prfolder'//outputs"
        local prdofile "`prfolder'//dofiles"
                }
    else if regexm(c(os), "Windows") == 1 {
        local working     "`stem'//Girish Files/gv_includes/"
        local prfolder "`stem'//Girish Files/A_UOC academic/`prfolder'"
        local output     "`prfolder'//outputs"
        local prdofile    "`prfolder'//dofiles"
        }
    *------------CHANGE WORKING DIR to Project ROOT----------*    
    cd "${working}"
    
    di "Working directory (working)        : `working'"
    di "Project dir (prfolder)            : `prfolder'"
    di "Output dir (outputs)            : `output'"
    di "Project do file dir (prdofile)     : `prdofile'"    
    cd "`prfolder'"
    // END OF CONTENTS FOR DIRECTORY_INITIALIZATION.DOH
    // SAVE IN SEPARATE FILE DIRECTORY_INITIALIZATION.DOH
    
    Sample call in master do.file:
    
    local prfolder Prj_akp53
    include DIRECTORY_INITIALIZATION.DOH
    use "`prfolder'//p53merged-v2.dta"
    qui include "./p53select.doh" // NOT SURE WHAT THIS IS OR HOW IT RELATES TO THIS
    
    
    ***Several lines down further
    sts graph....
    graph export "`output'//testgraph.png", replace
    Notes:
    1. To insure that DIRECTORY_INITIALIZATION.DOH is available to everybody, it is probably best to put it in the PERSONAL folder of the adopath.
    2. I'm not sure what all the // digraphs in the pathnames are about. Is that a Mac thing? I would think that single / would do in all cases--but not knowing for sure, I left these things the way you had them.


    Comment


    • #3
      What you are doing here is completely reasonable in my opinion. Global variables are only a problem when different consumers of the global mutate the content of the global. Suppose I have the following do file:

      Code:
      global problem_global = 0
      
      forv i = 1/10{
          global problem_global = $problem_global + 2
          display i + $problem_global
      }
      Now, suppose I run this code, but I get a syntax error on the display line because I've forgot to wrap i in compound quotes (it should be `i'). Okay, no problem, so I fix the syntax error, highlight the loop, then hit control D to run that block of code again. The problem is, the last time I ran this code I added to to the problem_global before I hit the error, so the state has already changed, and I forgot to highlight the line with the global initialization, so it isn't reset. So now I'm printing incorrect results. In this toy example it should be pretty easy to figure out what happened, but as programs become more complex, this kind of issue become much more of a problem. Globals can ultimately become a serious headache to debug, or worse, you could get incorrect results without realizing they are incorrect. If I had used a local for problem_global, I would have got a syntax error when I tried to run the forv loop by itself the second time.

      Stata actually globalizes data all of the time. Variables are stored in the global environment, and the language does quite a bit to try to mitigate the kind of errors associated with global data. This is why you need the "clear" option to overwrite data with the -use- command for instance. Alternatively, suppose I now have three string variables I'd like to convert to a number.

      Code:
      foreach var in 1 2 3{
          destring string`var', gen(num`var')
      }
      But now suppose variable 3 contains some string value that can't be interpreted as a number, like "." or "N/A". I can go back and replace those values so that they are the empty string and treated as missing. Now if I run that loop again, Stata will give me an error telling me that the variable num1 already exists. That's because we already created the variable in the loop. I might think this is kind of annoying because now I have to explicitly drop those two variables before I can run my loop again. Stata doesn't let the user do this in order to avoid some of the problems associated with globalized data. Rather than allowing you to just hammer over global data with the -generate- command, Stata requires that the user prove they know the variable already exists with the -replace- command.

      In contrast, your command effectively defines a series of constants, which should never change over the runtime of the program. Since these globals are constant with respect to the state of the current user, you avoid essentially all of the problems associated with global variables. I like this design. I like that you print the relevant data to the console. That should make things a bit easier to debug, if need be.

      I do think you have a problem with this setup, but it isn't related to globals. Suppose two people, person A and person B, are working on the same do file at the same time without realizing it. Person A makes some changes to the do file, then saves those changes to the disk before leaving for the night. Person B already has the original do file loaded in memory. Person B makes a series of changes to the do file, then saves those changes to the disk. Now person A's changes are effectively erased. The scalable industry way to solve this problem is to use git and a shared git repository rather than dropbox. In a small team like this, you can mitigate this issue with good communication, by making sure everyone is clear on who has the ownership of which files at any given time, and by looking out for this kind of thing understanding that it might still happen occasionally.

      Other than that, I think you're good to go!
      Last edited by Daniel Schaefer; 04 Oct 2023, 12:47.

      Comment


      • #4
        Regarding note 2 from #2: Not a mac thing as far as I'm aware, and I'm posting from a mac. macOS is unix based, supports BASH, and should have essentially the same filesystem conventions as linux. It reminds me of double backslash to escape the escape character, but that is neither necessary in Stata nor applicable here.

        Comment


        • #5
          As much as I like the solution in OP, I agree that the solution in #2 is better practice and more scalable in general. I think the solution in the OP works (and might be a bit more convenient than #2) so long as it is designed with a small in-house team in mind.

          Comment


          • #6
            In contrast, your command effectively defines a series of constants, which should never change over the runtime of the program. Since these globals are constant with respect to the state of the current user, you avoid essentially all of the problems associated with global variables.
            The risk is that somewhere in a program that you call, which in turn calls another program, which in turn calls another ..., somebody else has used a global macro whose name matches one of yours will change the value, or perhaps it relies on the value having been set differently and then itself malfunctions. Globals with names like output or working are especially vulnerable prey.

            If it were difficult to work around this, say, requiring complicated error-prone code, then one might take this risk. But as you can see, doing this with an -include- file is just as easy. It doesn't even add a line of code. It avoids a low-probability risk that can be extremely problematic if it occurs, and does so at zero cost.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              I'm the person who most frequently warns against using global macros around here.

              Look, you will get away with it nearly all the time. The problem is that in the infrequent situation where it goes wrong, it goes disastrously wrong and finding the source of the problem is extremely difficult and frustrating. The way it can go wrong is if somebody runs another program that uses a global macro with the same name as one of the macros you have defined. That program might misread its global macros as having the value of the one you defined. Or the other way around: itmight change the value of your global macro which you then use. And what makes it so frustrating to deal with is that you might not even know that this other program exists: it could be a program called within a program called within a program and you are only aware of the existence top level program (which might, for example, be a user-written program that you use without knowing the details of its innards). The effects of a global macro whose contents have been switched in part of the code you don't see can be bizarre and protean, and finding the source of the problem can be a Gordian knot. I would be particular concerned about global macros with names like program and output, which might be in common use. prfolder and prdofile as a bit less worrisome.

              If I were in your situation, I would make an include file that defines all of these things as local macros (and, crucially, not inside a program).
              Code:
              // BEGIN DIRECTORY_INITIALIZATION.DOH
              * Sets Dropbox for all 3 PC and MACs
              * Define working directory for Project folder
              * Argument to set project folder as working dir
              *-----------------------------------
              *Stata versions
              version 17
              clear all
              set linesize 80
              
              set more off
              frames reset
              cap drop __000*
              cls
              
              
              // Define all 4 roots directories to dropbox across Emily and GV PCs
              *-------------------------------------------------------------------
              cap cd "/Users/emilys/Dropbox/"
              cap cd "C:/Users/Vijayalakshmi/Dropbox/"
              cap cd "C:/Users/NewPC/Dropbox/"
              cap cd "D:/gvenkataraman/Documents/Dropbox/"
              
              
              *-------------DEFINE STEM DIR---------------------------*
              
              local stem `c(pwd)' // Everything leading upto Dropbox dir
              di "`stem'"
              
              
              *---------DEFINE WORKING & SAVING DIR-------------------*
              
              if regexm(c(os), "Mac") == 1 {
              local working "`stem'//gv_includes/"
              local prfolder "`stem'//`prfolder'/"
              local output "`prfolder'//outputs"
              local prdofile "`prfolder'//dofiles"
              }
              else if regexm(c(os), "Windows") == 1 {
              local working "`stem'//Girish Files/gv_includes/"
              local prfolder "`stem'//Girish Files/A_UOC academic/`prfolder'"
              local output "`prfolder'//outputs"
              local prdofile "`prfolder'//dofiles"
              }
              *------------CHANGE WORKING DIR to Project ROOT----------*
              cd "${working}"
              
              di "Working directory (working) : `working'"
              di "Project dir (prfolder) : `prfolder'"
              di "Output dir (outputs) : `output'"
              di "Project do file dir (prdofile) : `prdofile'"
              cd "`prfolder'"
              // END OF CONTENTS FOR DIRECTORY_INITIALIZATION.DOH
              // SAVE IN SEPARATE FILE DIRECTORY_INITIALIZATION.DOH
              
              Sample call in master do.file:
              
              local prfolder Prj_akp53
              include DIRECTORY_INITIALIZATION.DOH
              use "`prfolder'//p53merged-v2.dta"
              qui include "./p53select.doh" // NOT SURE WHAT THIS IS OR HOW IT RELATES TO THIS
              
              
              ***Several lines down further
              sts graph....
              graph export "`output'//testgraph.png", replace
              Notes:
              1. To insure that DIRECTORY_INITIALIZATION.DOH is available to everybody, it is probably best to put it in the PERSONAL folder of the adopath.
              2. I'm not sure what all the // digraphs in the pathnames are about. Is that a Mac thing? I would think that single / would do in all cases--but not knowing for sure, I left these things the way you had them.

              Glad to get your take on this, @Clyde Schecter. I will try this alternate method and see if it works. Good to know that the // is called a digraph. I picked that up from some code earlier in the posts and it seemed to work and provided some separation between the preceding local/global paths and the rest of the path following it. I had no specific reason for picking it.

              Comment


              • #8
                Clyde Schechter you make a compelling argument. I was thinking that the header file would need to be reloaded each time any block of code is executed because the locals would be dropped. I was mistaken about this. From the documentation:

                include differs from do and run in that any local macros (changed settings, etc.) created by executing the file are not dropped or reset when execution of the file concludes. Rather, results are just as if the commands in filename appeared in the session or file that included filename.
                That said, I'm not sure using a header file avoids all of the problems associated with global variables. I may be wrong here, but it seems like it should be possible to make a set of commands that share a header file. The header file might gather some metadata about the environment at load-time, which might be used by each of those commands at runtime. If that header file uses locals with the same name as the ones in #2, then you might still be unwittingly overwriting important data.

                I continue to think using headers like this to create persistent locals would be unwise if you are mutating the macro at runtime.

                Comment


                • #9
                  Thanks again, Daniel Schaefer for the lucid example. I needed that. I am leaning towards Clyde's method especially since I am fairly comfortable with include files (love em' and almost overuse it, thanks to Clyde's suggestion earlier several months ago) not to mention I barely know globals and even less about wrapping them stably into programs. To clarify Clyde's query earlier in my code below, the .doh file was to select the right data subset for the study (after exclusions) and stset the data at the same time.

                  Code:
                   
                   qui include "./p53select.doh" // NOT SURE WHAT THIS IS OR HOW IT RELATES TO THIS

                  Comment


                  • #10
                    Clyde Schechter : I tried the method you suggested saving the information up to the Initialization code into the .doh file and saving it in the Ado/personal folder. When I call the initialization in a new session it throws an r601 error (file not found). I don't see any typos in file names or issues with the file in the personal path. What could I be missing?

                    Code:
                    . ls "C:\Users\gvenkataraman\ado\personal/"
                      <dir>  10/04/23 15:52  .                 
                      <dir>  10/04/23 15:52  ..                
                       1.6k  10/04/23 15:48  DIRECTORY_INITIALIZATION.DOH
                      <dir>   8/11/23 13:49  grec              
                       0.6k   5/24/23 11:15  gv.ado            
                    
                    . do "C:\Users\GVENKA~1\AppData\Local\Temp\STD5478_000000.tmp"
                    
                    . local prfolder Prj_akp53
                    
                    . include "DIRECTORY_INITIALIZATION.DOH"
                    file DIRECTORY_INITIALIZATION.DOH not found
                    r(601);

                    Comment


                    • #11
                      I am leaning towards Clyde's method
                      Girish Venkataraman What can I say: as a rule it's hard to go wrong by following Clyde's advice. Headers seem to be built for exactly the situation you have here. It's a good idea.

                      Comment


                      • #12
                        Code:
                        include "DIRECTORY_INITIALIZATION.DOH", adopath
                        Sorry I forgot to put that in #2.

                        Comment


                        • #13
                          Originally posted by Clyde Schechter View Post
                          Code:
                          include "DIRECTORY_INITIALIZATION.DOH", adopath
                          Sorry I forgot to put that in #2.
                          That got it working, Clyde Schechter. However, I am having to keep repeating the initialization chunk below every time I want to point to a different directory higher up to grab an include file or change directory to a downstream folder to save a figure. The scope still seems limited within the working do file in a new session. I somehow presumed I could call all the local-ized directories (in a global manner) repeatedly through the do file without having to reinitialize at every instance. This is more so for the putdocx manuscript result files where I conduct repeated analysis line-by-line using several -include- files higher up from the project folder location.


                          Code:
                           
                           local prfolder Prj_akp53  include DIRECTORY_INITIALIZATION.DOH, adopath

                          Comment


                          • #14
                            The scope of the local macros defined in the -include- file is the program in the "master" do file that -include-s it. If the do-file contains programs defined within it or calls other do-files and they need the information in those locals either it must be passed to them as program arguments or the program must also -include- the header file. But within the master do-file (not any program embedded in it or called by it) you do not need to reinitialize. The locals are good throughout that level of the code.

                            Last edited by Clyde Schechter; 04 Oct 2023, 21:54.

                            Comment


                            • #15
                              Originally posted by Clyde Schechter View Post
                              The scope of the local macros defined in the -include- file is the program in the "master" do file that -include-s it. If the do-file contains programs defined within it or calls other do-files and they need the information in those locals either it must be passed to them as program arguments or the program must also -include- the header file. But within the master do-file (not any program embedded in it or called by it) you do not need to reinitialize. The locals are good throughout that level of the code.
                              Just so I understand what you are saying, any non-initialization -include- files used in the "master" do-file needs to carry this initialization header information within it, so the chain of initialization is not broken within the "master" do-file. I can certainly fix all my other include files to carry the header info. Just to check the premise, I have this code bit in my "master" do file merely calling three directories one after the other without calling any other program or -include- files. The middle segment will not run by itself (even though I had run the header segment with first 3 lines initially). Does this reflect your intended behavior of the initialization code?

                              Which makes me wonder, what the initialization exercise achieved. I could simple point directly to directories with a full path in a single line for every instance without calling the initialization, no?


                              Code:
                              // HEADER INCLUDED : changes to working dir
                              *------------------------------------------
                              local prfolder Prj_akp53
                              include DIRECTORY_INITIALIZATION.DOH, adopath
                              cd "`working'"   // generic includes folder location
                              
                              
                              // Next line run as a single line (after having run the first 3 lines separately) : r170 error.
                              *--------------------------------------------------------------------------------------------------
                              cd "`prfolder'/dofiles"   // project specific includes folder
                              
                              
                              // REINITIALIZED Header, the cd works again.
                              *----------------------------------
                              local prfolder Prj_akp53
                              include DIRECTORY_INITIALIZATION.DOH, adopath
                              cd "`working'"  // exel import folder

                              Comment

                              Working...
                              X